07. DBSCAN Algorithm

DBSCAN Algorithm

Original data on the left and clusters identified by the DBSCAN algorithm on the right.  For DBSCAN clusters, large colored points represent core cluster members, small colored points represent cluster edge members, and small black points represent outliers.

Original data on the left and clusters identified by the DBSCAN algorithm on the right. For DBSCAN clusters, large colored points represent core cluster members, small colored points represent cluster edge members, and small black points represent outliers.

The DBSCAN Algorithm

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. This algorithm is a nice alternative to k-means when you don' t know how many clusters to expect in your data, but you do know something about how the points should be clustered in terms of density (distance between points in a cluster).

The DBSCAN algorithm creates clusters by grouping data points that are within some threshold distance d_{th} from the nearest other point in the data.

The algorithm is sometimes also called “Euclidean Clustering”, because the decision of whether to place a point in a particular cluster is based upon the “Euclidean distance” between that point and other cluster members.

You can think of Euclidean distance the length of a line connecting two points, but the coordinates defining the positions of points in your data need not be spatial coordinates. They could be defined in color space, for example, or in any other feature space of the data.

The Euclidean distance between points \bold{p} and \bold{q} in an n-dimensional dataset, where the position of \bold{p} is defined by coordinates (p_1, p_2, …, p_n) and the position of \bold{q} is defined by (q_1, q_2, …, q_n) then the distance between the two points is just:

DBSCAN Clustering Steps:

Suppose you have a set P of n data points p_1,p_2,…, p_n:

  • Set constraints for the minimum number of points that comprise a cluster (min_samples)
  • Set distance threshold or maximum distance between cluster points (max_dist)
  • For every point p_i in P, do:
    • if p_i has at least one neighbor within max_dist:
      • if p_i's neighbor is part of a cluster:
        • add p_i to that cluster
      • if p_i has at least min_samples-1 neighbors within max_dist:
        • p_i becomes a "core member" of the cluster
      • else:
        • p_i becomes an "edge member" of the cluster
    • else:
      • p_i is defined as an outlier

The above code is a bit simplified, to aid in a beginner's understanding of the algorithm. If you visited every data point at random, you would end up creating significantly more clusters than needed and would have to merge them somehow. We can avoid this by controlling the order in which we visit data points.

Upon creating a new cluster with our first qualifying data point, we will add all of its neighbors to the cluster. We will then add it's neighbor's neighbors, and their neighbors, until we have visited every data point that belongs to this cluster. Only after doing so, do we move on to choosing another data point at random. This way, we can guarantee that a cluster is complete, and there are no more points in the data set that belong to the cluster.

The full pseudo-code implementation of the DBSCAN algorithm can be found on the DBSCAN Wikipedia page.

Here's how the outcome looks visually for an example where min_samples = 4:

Blue points are core cluster members having at least three neighbors within `max_dist` (satisfying number of neighbors plus self `>= min_samples`.  Yellow points are edge members having neighbors, but fewer than three.  Black points are outliers having no neighbors.

Blue points are core cluster members having at least three neighbors within max_dist (satisfying number of neighbors plus self >= min_samples. Yellow points are edge members having neighbors, but fewer than three. Black points are outliers having no neighbors.

It is once again worth visiting Naftali Harris' site to experiment with the interactive DBSCAN visualizer that he created.

While there, I recommend taking a look at Example A, as it demonstrates an interesting corner case where one data point falls into more than one cluster.

DBSCAN Playground

In this exercise I've built you a sandbox to play with the DBSCAN implementation from SciKit Learn. Let's take a look at the code used to implement this. First off, you'll be using the same cluster_gen() function as before in the k-means exercise. This time you can find it in the extra_functions.py tab below.

As far as new code goes, you've got the following:

# Import DBSCAN()
from sklearn.cluster import DBSCAN

After generating some data you're ready to run DBSCAN()

# Define max_distance (eps parameter in DBSCAN())
max_distance = 1.5
db = DBSCAN(eps=max_distance, min_samples=10).fit(data)

You can then extract positions and labels from the result for plotting:

# Extract a mask of core cluster members
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
# Extract labels (-1 is used for outliers)
labels = db.labels_
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
unique_labels = set(labels)

And the rest is just visualizing the result! Play around and see what happens as you vary parameters associated with the data you generate as well as varying max_distance and min_samples.

Start Quiz:

import numpy as np
import matplotlib.pyplot as plt
import cv2
from sklearn.cluster import DBSCAN
from extra_functions import cluster_gen

# Generate some clusters!
n_clusters = 50
clusters_x, clusters_y = cluster_gen(n_clusters)
# Convert to a single dataset in OpenCV format
data = np.float32((np.concatenate(clusters_x), np.concatenate(clusters_y))).transpose()
# Define max_distance (eps parameter in DBSCAN())
max_distance = 1
db = DBSCAN(eps=max_distance, min_samples=10).fit(data)
# Extract a mask of core cluster members
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
# Extract labels (-1 is used for outliers)
labels = db.labels_
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
unique_labels = set(labels)

# Plot up the results!
min_x = np.min(data[:, 0])
max_x = np.max(data[:, 0])
min_y = np.min(data[:, 1])
max_y = np.max(data[:, 1])

fig = plt.figure(figsize=(12,6))
plt.subplot(121)
plt.plot(data[:,0], data[:,1], 'ko')
plt.xlim(min_x, max_x)
plt.ylim(min_y, max_y)
plt.title('Original Data', fontsize = 20)

plt.subplot(122)
# The following is just a fancy way of plotting core, edge and outliers
# Credit to: http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-py
colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = data[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=7)

    xy = data[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=3)
plt.xlim(min_x, max_x)
plt.ylim(min_y, max_y)
plt.title('DBSCAN: %d clusters found' % n_clusters, fontsize = 20)
fig.tight_layout()
plt.subplots_adjust(left=0.03, right=0.98, top=0.9, bottom=0.05)
import numpy as np
import matplotlib.pyplot as plt
import cv2

# Define a function to generate clusters
def cluster_gen(n_clusters, pts_minmax=(10, 100), x_mult=(1, 4), y_mult=(1, 3), 
                x_off=(0, 50), y_off=(0, 50)):
    
    # n_clusters = number of clusters to generate
    # pts_minmax = range of number of points per cluster 
    # x_mult = range of multiplier to modify the size of cluster in the x-direction
    # y_mult = range of multiplier to modify the size of cluster in the x-direction
    # x_off = range of cluster position offset in the x-direction
    # y_off = range of cluster position offset in the y-direction
    
    # Initialize some empty lists to receive cluster member positions
    clusters_x = []
    clusters_y = []
    # Genereate random values given parameter ranges
    n_points = np.random.randint(pts_minmax[0], pts_minmax[1], n_clusters)
    x_multipliers = np.random.randint(x_mult[0], x_mult[1], n_clusters)
    y_multipliers = np.random.randint(y_mult[0], y_mult[1], n_clusters)
    x_offsets = np.random.randint(x_off[0], x_off[1], n_clusters)
    y_offsets = np.random.randint(y_off[0], y_off[1], n_clusters)
     
    # Generate random clusters given parameter values
    for idx, npts in enumerate(n_points):
        
        xpts = np.random.randn(npts) * x_multipliers[idx] + x_offsets[idx]
        ypts = np.random.randn(npts) * y_multipliers[idx] + y_offsets[idx]
        clusters_x.append(xpts)
        clusters_y.append(ypts)
    
    # Return cluster positions
    return clusters_x, clusters_y